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Abstract 



When conducting inter-species regression analyses, the phylogenetic re- 
lationships between the individual species need to be taken into account. 
In this paper, a procedure for conducting such analyses is discussed, which 
only requires the use of a measure of relationship between pairs of species, 
rather than a complete phylogeny, and which at the same time assesses the 
importance to be attached to the relationships with regard to the conclu- 
sions reached. The procedure is applied to data from Minder et al. (2005), 
relating testis size to mean hind tibia length, duct length and spermathecal 
area in 15 species of Scathophagidae (Diptera). We show that considering 
the phylogenetic structure significantly improves the fit of the model to the 
data. We find a robust relationship between testis size and spermathecal 
area but could not support one between testis size and spermathecal duct 
length. 



Keywords: likelihood based inference, Ornstein-Uhlenbeck process, phyloge- 
netic relationship. 
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1 Introduction 



Comparative studies are a widely employed and powerful tool in evolutionary 
investigations. They have been used to elucidate macro-evolutionary patterns 
for many phenomena, including testis and sperm size evolution (e.g. Gage 1994; 
Hosken 1997), brain size evolution (e.g. Martin 1981; Pagel & Harvey 1989) and 
the scaling of metabolic rates (e.g. Thompson & Withers 1998; McNab 2002). 
Formerly, species values of characters of interest were regressed against putative 
predictor variables to elucidate possible relationships (e.g. Cummins & Woodall 
1985). However, because of common descent, species do not represent indepen- 
dent data points, and hence species level analyses based on simple regression 
analyses have been criticised (Harvey & Pagel 1991). This is not to say that phy- 
logeny has primacy of cause over other factors, merely that species level analyses 
may be misleading (Harvey 2000). For example, a simulation study by Martins & 
Garland (1991) investigated the across-species association between two variables, 
and found that the Type I error rate was 16%; when they employed phyloge- 
netic control, the error in the regression was reduced to 5%. This paper presents 
a procedure for conducting regression analyses in the presence of phylogenetic 
relationships, which also assesses the importance of these relationships in the 
analysis. The procedure is based on the Ornstein-Uhlenbeck model of the way 
in which inter-species differences evolve, and is closely related to the simplest 
version of Hansen's (1997) approach. 

In order to derive our procedure, we begin with a more detailed exposition 
of the underlying problem. In the classical regression model, the value y of 
the 'dependent' variable of interest is expressed linearly in terms of the values 
x( 2 \ ... of a number of explanatory 'covariables', up to an additive 'error' e, 
which accounts for any variation in y not attributable to the covariables. Thus, 
for each of n observations indexed by i, 1 < i < n, we write 

Vi = £<«>) + jp) x W + pVi x fi + ... + pW x W + ei: (i) 

where the f3^\ < j < k, are the coefficients which relate the values of the 
covariables to that of y, and the are needed because, in practice, it is usually 
impossible to find values P^°\P^\ . . . , (3^ such that 

Vl = + /jWxf) + p<a x ? > + . . . + pw x \* > (2) 

is exactly true for all i, if n > k + 2. The values of the parameters (3^ are 
estimated and their significance tested with reference to some probabilistic model, 
which is assumed to govern the values ei of the errors actually occurring; the 
simplest assumption is that the e*, 1 < % < n, arise as realizations of independent 
random variables £j, 1 < % < n, which have zero mean and common unknown 
variance a 2 , and are normally distributed. 

If the indices 1 < % < n in fact represent n species, as in the setting introduced 

(1) (k) 

above, and if the measured values yt and x\ , . . . , x] are 'typical' values for the 
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species, the assumption of independent errors may well be violated. This is 
because the variation in y unexplained by the can be thought of in part as 
resulting from evolutionary change in other, unobserved explanatory covariables, 
so that closely related species can be expected to exhibit rather similar values of e. 
Hence, when conducting regression analyses with such data, it seems important 
to take the phylogeny into account (Harvey & Pagel, 1991). 

The reasons for doing so are quite simple, and have long been understood 
in the context of quite general regression models. When the errors £j in such a 
model are in fact correlated, ordinary least squares (OLS) procedures still give 
parameter estimates which have the right expectation and are asymptotically 
consistent. However, as is especially relevant when the number of species is fixed 
and perhaps not large, their precision is less than that of the best estimates pos- 
sible for the actual correlation structure [Draper & Smith (1966: 80)]. Moreover, 
estimates of the precision of the OLS estimates, calculated in accordance with 
OLS assumptions, may be seriously in error [see, for example, Scheffe (1959: 339- 
343 and §10.4)]. In such cases, significance tests based on OLS are dangerous. 
In the particular context of inter-species regression, these features of the OLS 
analysis were noted by Pagel (1993). 

There is nonetheless still some debate about the efficacy of such phylogenetic 
control. This is primarily because the covariance of traits is explained by ecol- 
ogy and phylogeny, which typically overlap; hence, by controlling for phylogeny, 
variance due to ecology is also removed because of the overlap (McNab 2002). 
Despite such arguments, most investigators today use some form of phylogenetic 
control. 

A number of methods have been proposed for incorporating phylogeny into 
the regression, including trait mapping (Ridley 1985), nested analysis of vari- 
ance (Bell 1989, Stearns 1992), pairwise comparison (Felsenstein 1985, M0ller 
& Birkhead 1992), independent contrasts (Felsenstein 1985), and more directly 
Ornstein-Uhlenbeck based analyses (Hansen & Martins 1996): see also Lynch (1991) 
and Freckleton, Harvey & Pagel (2002). These methods all involve the use of a 
pre-existing phylogeny. Here, we propose a simple and effective procedure, which 
can either be applied using a known phylogeny, or else just using an inter-species 
distance matrix from which a phylogeny could potentially be constructed. The 
procedure has the advantage that it not only only respects the phylogeny, but also 
allows one to gauge its importance in the analysis. It also takes proper account 
of the (unknown) value at the root of the phylogeny. 

2 Procedure 

The basic idea is to return to the underlying assumption, that the e^'s result 
from a process of evolution along the branches of the phylogenetic tree. The 
evolutionary model that we use, which can be thought of as a natural generaliza- 



4 



tion of that of independent, normally distributed errors with common variance, 
supposes that the 'error' component evolves along each branch of the phyloge- 
netic tree as an Ornstein-Uhlenbeck (O-U) process, as discussed in some detail 
in Felsenstein (1988) and in Hansen & Martins (1996). The O-U process looks 
locally in time like a Brownian motion, as would naturally be the result of many 
small random genetic changes; however, it also has a tendency to return towards 
zero, which can be thought of as the result of selective pressure acting against de- 
partures from equality in (J2J), the strength of the tendency being larger for larger 
departures. For our purposes, the main features of the process are that it is a 
time-reversible Markov process, that its values are normally distributed, and that 
its autocorrelations decay exponentially with elapsed time. It is also important 
in acting as a good approximation to a wide variety of processes that result from 
the combination of random disturbances with a tendency to move back towards 
zero, in much the same way that the normal distribution is frequently a good 
approximation to sums of weakly dependent random variables. It thus represents 
a plausible null model for describing the 'errors' arising during evolution: see, 
for instance, Lande (1976). Its distributions are entirely characterized by the 
diffusion constant (infinitesimal variance) r 2 of its locally Brownian behaviour 
and by the exponential decay rate A of correlations; its equilibrium distribution 
is normal, with mean zero and variance a 2 = r 2 /(2A). We shall denote such a 
process by OU (r 2 , A). 

Our model supposes that an OU(r 2 ,A) process starts in equilibrium at the 
root of the phylogenetic tree, and runs, with time corresponding to distance along 
the branch, until the first split. At this point, its value is taken as the initial value 
for two independent OU (r 2 , A) processes, which then continue to run along the 
two branches until they split again; and so on. The species, the leaves of the 
trees, are assigned as values of the values of the OU (r 2 , A)-processes at the 
ends of the final n branches. This model of evolution along the tree results in 
values Ci realized from jointly normally distributed random variables E\,...,e n 
having equal variances a 2 , but now with correlations 

C u {\):= Con {e u e l )=e- Xd ^ (3) 

where du is the distance between species i and species I along the tree; see, 
for example, Hansen & Martins (1996: Equation (7)). Details are given in the 
Appendix. 

The value of the decay rate A, in combination with the values of the du, is 
seen from Q to determine the importance of the inter-species correlations in the 
analysis. However, the dependence between the errors is better understood from 
the following explicit representation. If two species % and I diverge at a time 
at which the value of the OU (r 2 , A) process for their common ancestral species 
takes the value Xo, and if the evolutionary distances to % and I from this time 
until the present are di and di, respectively, then the values of the O-U processes 
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X® and X® taken by the species % and / can be written as 



X (/) = X e~ Xdl +V il) . (4) 

Here, X , V"W and V® are independent, and the latter two random variables 
have zero means and variances 

and ^(1 - e^), 

respectively. The correlation between X^ and arises solely as a result of the 
elements containing X , the value at their common ancestral species, and these 
elements can be seen to decline at exponential rate A as evolutionary distance 
increases. In the extreme in which A — * 0, the elements containing X both 
remain constant at the value Xq, and the Brownian diffusion model with diffusion 
constant r 2 results. In the extreme in which A — > oo, the elements containing X 
both tend to zero, and the values X® and become independent, implying 
independent errors for each species. Thus, when fitting the OU(r 2 ,A) model to 
data, a best fit with large A indicates more or less independent species, and one 
with A very small indicates a Brownian-like model of evolution, a fact noted also 
by Blomberg et al. (2003) (with their parameter d corresponding to our e _A ). 
Neither limit is, however, entirely free of surprises: see the Appendix §4.2 for 
more details. 

Regression analysis based on this model is easy if the phylogenetic tree - 
specifically, all the tree-distances du between pairs of species — are known. Then, 
for any fixed A, < A < oo, the problem reduces to a generalized least squares 
analysis: the correlation matrix C(A) can be calculated, and maximum likelihood 
for the linear model with normally distributed errors and known correlation ma- 
trix can be used to find estimates ^ (X) , ^ (X) , . . . , (3^ (X) and <x 2 (A) of the 
remaining model parameters, together with L(X), the maximum value of the log- 
likelihood for this value of A. The value A to be used as an estimate of the true 
value of A is now obtained by maximizing L(X) iteratively with respect to A, for 
instance using a golden section search. This, as in Hansen (1997: 1345), yields 
the final parameter estimates 

A,/3(°)(A),...,^)(A),a 2 (A) 

for the regression. 

This rather simple procedure has an important drawback. If, for fixed r 2 , 
the value of A becomes very small, the variance a 2 = r 2 / (2A) of the equilibrium 
distribution becomes large, and, as a result, any particular observed value has 
correspondingly low likelihood. Indeed, for such A, all values are strongly related 
to the (unknown) value at the root, which is itself a single value chosen from the 
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equilibrium distribution. This leads to a large negative contribution to the log 
likelihood, reflecting nothing more than the potential variability in the value at 
the root, whose effects are clearly visible in Hansen (1997: Tables 1 and 2). It 
may seem unnatural to include this in a comparative study, in which the value 
of the overall mean (3^ is typically of little interest. There is also a companion, 
biological consideration; for very small A, it is doubtful whether enough time can 
have elapsed for the value at the root to have reached statistical equilibrium. 
In view of this, we prefer to centre all the covariables x^ l \ . . . at zero, and 
to base the analysis on the likelihood derived from the joint distribution of the 
centred y- values 

yi-y,...,y n -y, 

where y, as usual, denotes the overall mean of the observations. Because the 
covariables have been centred, the model now becomes 

yi - y = pWxP + pWx® + ■■■ + P W *l k) + ^ ( 5 ) 

where £j = £j — e. The overall mean (3^ no longer appears in the model, all 
other parameters have their original meaning, and the log likelihood no longer 
converges to — oo as A — > 0, but instead approaches that of the Brownian model. 
Other attempts to circumvent this difficulty, used in Blomberg et al. (2003) and in 
Butler & King (2004), are discussed in the Appendix, at the end of §4.1; neither 
seems to be entirely satisfactory. 

For each given A, the linear model theory also gives the standard deviations 
to be associated with the parameter estimates /3^(A), < j < k, and these can 
be used with A = A as reasonable approximations to the standard deviations of 
the estimates /3^(A), and hence for tests of hypotheses. However, A has been 
chosen from the data, and this source of variability is not included in such 'plug- 
in' approximations; simulating data samples from the model obtained from the 
estimated parameters, and then using an identical estimation procedure, gives an 
alternative way of judging the actual precision obtained, as well as indicating any 
possible bias. If the value of A is itself of interest, an approximate 95% confidence 
region based on large sample theory is given by the set of all A such that 

L(A)-L(A)<2 (6) 

[c.f. Edwards (1972: 80), Hansen (1997: 1345)]. This region may include A = oo, 
in which case an analysis that neglects inter-species correlations should still be 
reliable. Again, simulating data samples from the estimated model gives another 
measure of the variability in the estimates of A. 

In practice, the phylogenetic tree is never known precisely, complete with 
distances. However, the method proposed here can be expected to give useful 
results, even when the distances du are only approximately known; as long as the 
correlation structure is reasonably represented, gross errors in the conclusions 
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arising from this source should be avoided. Thus, if any molecular or morpho- 
logical data for the species are available, on the basis of which a tree can be 
reconstructed, this can be carried out, and the corresponding tree distances used 
for the du. 

Alternatively, this relatively difficult step can be avoided by using the mor- 
phological and molecular data to define a measure of distance between pairs of 
species — in any case, often the starting point for a tree construction — and 
by then using these 'raw' distances directly in place of the du. This procedure 
may seem controversial, but it should not be. If the raw distances are rather 
close to being tree distances, then the tree constructed from them should yield 
inter-species distances which are not very different, and the results of the pro- 
cedure will change correspondingly little. However, if the raw distances are not 
particularly tree-like, then the tree constructed from them may well yield rather 
different inter-species distances, but without any guarantee that they result in 
a more reliable picture of the actual correlations. Computationally feasible tree 
growing algorithms provide intelligent heuristics, but offer no guarantee of find- 
ing the correct phylogeny. However, using the raw distances, one at least has 
tangible data as input, rather than output from a black box, and our procedure, 
because of the freedom to choose the value of A, still gives a reasonable idea of 
how strongly relationship (expressed in terms of the raw distances) affects cor- 
relation. The phylogeny may not have been determined, but the analysis still 
makes reasonable allowance for inter-species similarity. 

In theory, there may be a problem if the raw distances are too far from being 
tree distances, because the resulting matrices C(A) need not then be positive 
semi-definite for all values of A, as has to be the case for correlation matrices. 
However, Bochner's theorem [Defant & Floret (1993: 316)] implies that p'th 
powers of /2-distances, for p < 2, never give rise to this problem, and that 
the same applies if a distance can be represented as a sum of such distances; 
thus, for instance, Hamming distance (number of mismatches) can be used for 
molecular data, and can be added to Euclidean distances between morphometrical 
characters. 

Computer programs, written in R, for performing both estimation and simu- 
lation from the estimated model, can be obtained from the authors. 

3 Example 

The procedure is illustrated by application to data in Minder et al. (2005), with a 
regression of testis size y as a function of mean hind tibia length (HTL) x^ , sper- 
mathecal duct length and spermathecal area a;® in 15 species of Scathophagi- 
dae (Diptera; true flies) [Table 1]. In the paper above, a corresponding analy- 
sis was made using the comparative analysis by independent contrasts program 
(CAIC) (Purvis & Rambaut 1994) to correct for the phylogeny, which was de- 
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duced from that of Bernasconi et al. (2000), itself derived from inter-specific 
differences in the sequence of 810 mDNA letters coding for the COI gene. Here, 
we look only at the 15 species considered by Minder et al. (2005). 

We consider three evolutionary distance matrices d. The first, d^\ is derived 
from the phylogeny depicted in Bernasconi et al. (2000: Fig. 1, 313), with the 
distance d^ between species % and I represented by the level in the tree at which 
their phylogenies merge (leaves at level 0, nearest neighbours at distance 1, etc.). 
This tree was rather carefully constructed from the COI data, using information 
about the positions of codons relative to the reading frame, A-T richness, and 
so on. Our second evolutionary distance matrix d^ 2 ' is much cruder, being based 
solely on the numbers of mismatches d^ between the COI sequences for species 
i and / [Table 2]: we set 

4 2) = -log(l-4 1} /105). (7) 

This matrix is chosen merely to reflect the fact that, as with most evolutionary 
models, the proportion of mismatches converges to a limit exponentially fast as 
evolutionary distance increases. Here, it is assumed that the limiting proportion 
of mismatches is 105/810, which is probably rather small in the context, but 
serves to exaggerate any effect caused by the non-linearity of the proportion of 
mismatches as a function of time. The third matrix d^ that we consider is that 
of the model in which the errors are independent and identically distributed, so 
that dtf = oo for all i ^ I; this, however, can be obtained also as the limit as 
A — ► oo of the two previous models. 

The procedure described in Section 2, with the correlation matrix C(A) cal- 
culated for each given value of A by substituting the evolutionary distance ma- 
trix d^ into Q, shows that x^ 2 \ spermathecal duct length, has no appreciable 
influence on y. Leaving out this covariable, the log-likelihood is maximized at a 
value of 27.17, with A = 0.0714 and with structural parameter estimates 

= 0.2353 and /3 (3) = 24.13, (8) 

and with f 2 = 0.00912. The standard deviations of (3^ and (3^, as calculated 
from C(A) and f 2 , are 0.078 and 8.05 respectively, and their estimated correla- 
tion is —0.294 (cf. Younger (1985: Sections 11.5-11.8)). The approximate 95% 
confidence interval for A calculated according to © was [0,0.6]. 

We describe the variability and dependence implied by the estimated model 
by first evaluating a quantity MSD, the median over all pairs of species i and I of 

SD(i,/) := y/m{( £i - £l )*} = 4= J(l-e~^) } 

V A 

the standard deviation of the difference between the errors for species i and I, as 
calculated for the estimated model (see (|12|) below). MSD is thus a measure of 
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the typical variability to be expected in such differences. We then consider the 
values of 

RSD : = SD (z,/)/MSD. 

If dependence has little effect on the analysis (A large), then all such values are 
close to 1; if a pair (i, I) is strongly dependent, then RSD (i, I) is close to zero. For 
the analysis here, we consider a pair li) (Norellia striolata and N. spinimona) 
which are very closely related, and another, (22,^2) (Scathophaga sulla and S. 
furcata), which are moderately closely related, as examples. For the analysis just 
conducted, we find that 

MSD = 0.2711, RSD (i u h) = 0.3460 and RSD (i 2 , l 2 ) = 0.7002. 

The maxima of the log-likelihood for the submodels obtained by omitting 
either x^- 1 ' or x^ are both substantially more than 2 smaller than that obtained 
above (differences 3.47 and 3.24 respectively), indicating that, at the 5% level, 
neither submodel should be preferred to the model estimated above. This means 
that hind tibia length (HTL) and spermathecal area are associated with testis 
size across the Scathophagidae after phylogenetic control using S l \ Minder et 
al. (2005) concluded that testis size was related to spermathecal area and sper- 
mathecal duct length, but not with HTL, after phylogenetic control. Our analysis 
supports the relationship with spermathecal area, making this a robust conclu- 
sion; especially since Minder et al. (2005) also found it with a species comparison. 
The relationship with HTL is intuitively appealing, as the simplest expectation is 
that, when a species gets larger, so do all of its body parts. Since the testis area 
to spermathecal duct length relationship is not robust in the different analyses, 
some caution must be exercised when considering the relationship reported in 
Minder et al. (2005). 

In order to judge the validity of the procedure, and to obtain an alternative 
assessment of the variability in the estimates, the model (8) and the correlation 
structure C(X) for the errors were held fixed, and data from this model distri- 
bution were simulated l'OOO times. The estimation procedure was then applied 
individually to each of the l'OOO resulting sets of data. These led to mean values 

/?« = 0.2353, and P ] = 24.14 (9) 

for the structural parameter estimates, with empirical standard deviations of 
0.0855 and 8.00 and an empirical correlation of —0.201 for the l'OOO estimates 
of f3 {l) and /3 (3) . The mean values p w and /3 (3) in are well in accord with 
the regression parameters and (3^ of the model (|8*|). as are the empirical 
standard deviations of the estimates and (3^ with the values calculated from 
C(A) and f 2 . 

The estimates of variability and dependence in the simulated data were by no 
means as stable. This is not particularly surprising, in view of the small num- 
ber (15) and large variability (MSD = 0.2711, as compared with estimated effects 
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/?WxSD(xW) = 0.0912 and x8D (x®) = 0.0524) of the observations. In just 
over 40% of the simulations, A was estimated to be zero, and in a further 5% to be 
infinity; the median value was 0.05, close to the actual value 0.0714, and the 90% 
confidence interval was [0, 0.69]. For the quantity MSD, the empirical mean over 
the l'OOO simulations was 0.2303, rather lower than the true value of 0.2711, and 
the empirical standard deviation was 0.0591. The negative bias in the empirical 
mean is to be expected, just as in the classical model with independent errors, 
where maximum likelihood makes no allowance for the fitted degrees of freedom 
when estimating the standard deviation. The empirical standard deviation of the 
MSD values is very much what would be expected when estimating a standard 
deviation from only 15 observations. The values of RSD had a minimum 

value of 0.2966, taken when A was estimated to be zero, and were thus heavily 
skewed; the median was 0.3288, not far from the true value of 0.3460, and its 90th 
percentile was at 0.7000. Analogously, the values of RSD (z 2 , h) had a minimum 
of 0.6567, a median of 0.7002, to be compared with the true value of 0.7223, and 
a 90th percentile at 0.9833. Thus, despite the wide variability in the estimated 
values of A, the results of the simulations as regards variability and dependence 
still showed a reasonable consistency. 

The same analyses can also be conducted with correlations based on the dis- 
tance matrix d^ 2 \ The results are broadly the same; the covariable x^ 2 ' is im- 
mediately dropped, and the model with and x^ has likelihood more than 2 
larger than that of either of the models with just one covariable. The parameter 
estimates in this model are 

= 0.2068 and /3 {3) = 24.11, 

with estimated standard errors of 0.0929 and 8.13, respectively, all of which are 
reasonably consistent with (|8*|). For the estimated error structure, we have 

MSD = 0.2558, RSD (i u h) = 0.3947 and RSD (i 2 , l 2 ) = 0.8056, 

the last two values indicating rather weaker evolutionary dependence than that 
found using S l \ but not outstandingly so. The main differences between the 
results with S 1 ^ and are that the maximum of the log likelihood using 
(at 25.57) is smaller by 1.6 than that for dP~\ suggesting that using the cruder 
matrix S 2S} leads to a somewhat less good fit, and that the decision to keep a^ 1 -* in 
the model is based on a likelihood difference of 2.12, quite a bit smaller than that 
obtained using S l \ indicating that the less good fit indeed entails some loss of 
precision, though not enough to change any of the main conclusions. The results 
of simulations confirmed the reliability of the procedure using S 2 ^ 1 in much the 
same way as it did when using 

If phylogenetic correlation is entirely neglected, and the model with inde- 
pendent and identically distributed errors is used, a significant loss of precision 
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is observed. The variable is still immediately rejected, and the structural 
parameters are estimated by 

= 0.2065 and (3 {3) = 19.82, 

fairly much as before; variability is estimated by MSD = 0.2674 (here correspond- 
ing to a residual standard deviation of around 0.19), and all RSD-values are 1. 
This model, however, has a log likelihood of only 23.53, more than 2 smaller 
than either of the two models fitted using phylogenetic correlation, and would 
therefore be rejected in comparison to them. Furthermore, under independence, 
the model with just has log likelihood 22.42, only 1.1 smaller, corresponding 
to a two-sided P- value of about 14%. This might suggest that should also 
be omitted from the model; however, the alternative for the effect of x^ is clear 
and one-sided, and the relevant P-value is more properly about 7%, giving weak 
support for retaining x^ in the model. Nonetheless, if phylogenetic correlation 
were neglected, there could be a danger that the effect of HTL on testis size would 
be missed. 

4 Discussion 

The method that we propose for conducting inter-species regression analyses is 
developed from the engagingly simple idea of using likelihood-based methods in 
conjunction with a stationary Ornstein-Uhlenbeck model of evolution. The result 
is a procedure which can be carried out without knowing the complete phylogeny 
- a measure of the evolutionary distance between pairs of species suffices — and 
which, at the same time, assesses the importance of inter-species relationship for 
the analysis. It is thus rather surprising that these advantages have not been 
emphasized earlier. 

Our example illustrates that the method works much as expected when ap- 
plied to a 'typical' biological data set. Although the detailed correlation structure 
was not reliably estimable, because there were few data points and a low signal 
to noise ratio, this still did not prevent the regression coefficients and their preci- 
sions being successfully estimated. Indeed, the procedure performed very well in 
a number of respects. It estimated the regression parameters and the variability 
of these estimates satisfactorily; it highlighted the extent to which the highly 
variable data did not support accurate estimation of the underlying dependence 
structure; and it indicated that, with these data, replacing the phylogeny with a 
crude estimate of the inter-species distances had little impact on the final con- 
clusions. 
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Appendix 

4.1 The O-U model. 

The O-U model for the error structure can be constructed as follows. We begin 
with a phylogeny consisting of a tree T with a root 0, n leaves and a set E of edges. 
The length of an edge e is denoted by £(e), and the distance from its rootward 
node to the root by t(e). There is a unique path Pj = (en, e i2 , ■ ■ ■ , e imi ) from 
the root to each leaf i in T, 1 < i < n, with = t(en) < t(e i2 ) < • ■ • < t(e imi ), 
with t(eij) = t(ejj_i) + £(ejj_i) for each 2 < j < rrii, and with leaf i at distance 
doi = t(e imi ) + £(e imi ) from the root. 

An OU (r 2 , A)-process X has the property that, for any s,u > 0, the condi- 
tional distribution of X(s + u) given X(s) = x is that of xe~ Au + X (m), where X 
is an OU (r 2 , A)-process with X (0) = 0, and thus X (u) is normally distributed 
with mean and variance <r 2 (l — e~ 2Xu ), where we write 



for the equilibrium variance. Hence the O-U error model on the tree can be 
constructed by associating with each edge e an independent normally distributed 
random variable Z(e) with mean and variance a 2 (l — e~ 2A£<6 )), and by then 
defining the error at leaf i to be 



where denotes the value of the error at the root. In our formulation, in which 
the O-U process stationary, is an independent normal random variable with 
mean and variance a 2 . 

It is now simple to deduce from (jlUj) and from the independence of the Z(e)'s 
that JEei = for all i and that (as has to be, because of stationarity) 



since the sum telescopes because i(e^) = t(ejj_i) +£(eij-i), and since t(en) = 
and d 0i = t(e imi ) + £(e inii ). For the covariances, we similarly have 



Cov ( £i ,ei) = e- A(d0l+doi) Var{Z (0) } + ^ Var {Z( eij )}e" A(doi+do ^ 2{ * (e ^ )+£(eij)}) , 



a 2 := r 2 /(2A) 




(10) 



Var Si 



e~ 2Adw Var{Z (0) } + ^ Var {Z(e.y)}e" 2A(do! -' (eij) ~^ )) 
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where 

Pi H Pi = (en, . . . , e ima ) = (en, . . . , ei ma ) 
is the overlap between the paths leading from the root to i and I. Hence 



COV (e h El) 



K 3=1 

cr 2 g- A {( d Oi-t(e lmi , )-i(e im . t ))+(d i -t(e lmu )-l(e lmu ))} 
(T 2 e -A(d,+ ( i i ) = a 2 e -Xd a 



-\(d 0i +d 0l ) J -I , V^r-, _ -2A«(ey)-| 2A(t(ey)+%y)) 




(11) 



where 



di — doi — t(ei ma ) 



£(e imu ) and di = d ot - t(ei ma ) - £(ei mu ), 



and du = di + di is the tree-distance from i to I, since ei mil = ei ma and this is the 
last common edge in the paths Pi and Pj. Equation (jHJ) now follows immediately. 

There is a non-trivial limit as A — > 0. In this limit, each Z(e) is normally 
distributed with mean and variance £(e)r 2 , so that the joint conditional distri- 
bution of the £j's given any fixed value Zq of is the same as for the Brownian 
model of evolution with infinitesimal variance r 2 and having value z$ at the root. 
Equivalently, one can check that the covariance structure of the OU (r 2 , A) model, 
restricted to the space of linear combinations of — e, 1 < i < n, converges to 
that of the Brownian model on the same space, where e := n~ x *YTi=\ £ ii irrespec- 
tive of the value of zq. Note that this limiting model does not have an equilibrium 
distribution. 

Instead of taking Z^ to have the stationary distribution in the O-U model, 
Blomberg et al. (2003) suppose that VarZ^ = 0, so that Z^ is considered to 
be fixed at some (unspecified) value Zq. Thus their formulae for variances and 
covariances differ from those above, in that the first term in each sum is lost, 
giving, in our notation, 



so that the root to leaf distances also enter their formulae. To this added compli- 
cation comes the problem of the means; they now have E'(e$) = Zoe~ Xdoi . Thus 
the usual linear model analyses, conducted on the assumption that errors have 
zero mean, are inconsistent with their formulation (at least, if the d^s are not all 
equal and if < A < oo) unless z$ is fixed to be zero. Hence their formulae are in 
general only valid if a time earlier than the first split in the phylogeny is known, 
at which the error is known to be exactly 0, and if this time is then taken to be the 
root. This seems to be rather an unlikely circumstance, and there is certainly no 
way of inferring the value of such a time from an inter-species distance matrix. 
Hence the stationary O-U model is much to be preferred; it pre-supposes merely 
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that OU (r 2 , A) -style evolution had already been taking place for a reasonable 
length of time before the first split in the phylogeny. 

Butler & King (2004), who are principally interested in more detailed mod- 
elling of adaptive evolution, have a very ingenious approach to Z^°\ They treat 
its value as a parameter of the model, to be estimated along with r 2 and A. This 
approach again has the disadvantage of including more elements of the phylogeny 
in the formulae. It is also not clear that inference about /3^\ . . . , (3^ would re- 
main invariant using their approach, if the root were moved further into the past 
from the time of the first split in the phylogeny, while leaving the rest of the phy- 
logeny unchanged; this should, however, logically be the case. In view of these 
considerations, our simpler model would seem to be preferable here also. 



4.2 Estimation as a function of A. 



To illustrate how the model estimated from data varies with the choice of A, 
consider the case in which there are no covariates, so that yt = [l + £j. Then the 
statistic 

^ n ^ n n 

s2 ■= — x £fa - y? = ^3iy £ £(w - w) a . 

i=l v ' i=l 1=1 

for a model with independent and identically distributed errors, is a natural 
estimator of the common variance of the e^'s. For the O-U error model, we have 

E{( £ , - et) 2 } = 2a 2 (l - e- Xd ») = j(l - e~ Xd "), (12) 
from ([lip, so that S 2 estimates t 2 D\, where 



1 n n 

2n(n -1) A V 
' i=i i=i 



Hence, for given A, a reasonable (but in general not optimal) estimator of r 2 is 
given by 

fl-=S 2 /D x . (13) 

If A is very large, then D\ ~ 1/(2A), and it thus follows from (|13|) that S 2 
estimates the equilibrium variance a 2 = r 2 /(2A), as is to be expected close to 
the model of independent errors. Note, however, that S 2 is a fixed function of 
the data, so that, from ([T5[l. the sequence of models estimated as A increases has 
t\ ~ 2A5 2 growing to infinity linearly with A. Thus, in this limit, the rate of 
random disturbances estimated from a fixed set of data tends to infinity. 

If, on the other hand, A — > 0, then D\ increases to its maximal value of 



v > i=i i=i 
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and t 2 is estimated by S 2 /Dq, as appropriate for the Brownian model. 

In the limit as A — > 0, the curiosity is rather the limiting value of 1 for 
Corr (si, ei), as implied by (JBJ). The reason for this is as follows. Suppose that X, 
V\ and V 2 are independent random variables, and that X\ := X+Vi, X 2 := X+V2. 
Then 

Corr(Xi,X 2 ) = Var X / ^(Var X + Var V x ) (Var X + Var V 2 ) 

where r^- = Var Vj/VarX for j = 1, 2. If now Var Vi and Var V 2 remain fixed, but 
VarX — ► 00, it follows that Corr (Xi,X 2 ) — > 1; and this despite the fact that 

E{(Xi - X 2 ) 2 } = {EVi - ~EV 2 } 2 + Var V x + Var V 2 

remains constant. Comparing this setting with that of (@J), it follows that the 
errors £j and £j for species i and I have correlation 1, in the limit as A — > 0, 
only because they inherit the same element Xo from their common ancestor 
species, whose variance a 2 = r 2 /(2A) tends to infinity as A — > 0. In particular, as 
implied by ()12|). E{(£j — £;) 2 } remains bounded away from zero and infinity as 
A — ► 0, so that the random variability between species does not disappear, even 
though Corr(£i,£j) — > 1. Indeed, the expression (fT2|l for E{(£, — Si) 2 } provides 
a consistent basis for comparing variability and strength of dependence across 
different models, and is used as such in Section 3: c.f. also Matheron's (1965) 
variogram. 



4.3 Departures from equilibrium. 

The analysis proposed in this paper supposes that the underlying O-U error pro- 
cess is in equilibrium. This is not a problem unless, as noted in Section 2, values 
of A are considered which are so small that it is doubtful whether equilibrium 
could ever have been reached. However, the value A = (for which there can be 
no equilibrium) represents the well-tried Brownian model of evolution, and it is 
therefore reasonable to ask how our procedure behaves for A very close to 0. 

Since we base our analysis only on the centred variables yi —y, it is enough to 
understand what happens for differences of errors £j — E\ between pairs of species. 
So suppose that the root value in (fTUj) is not at equilibrium, but instead has 
a normal distribution with mean \i and variance v given by 

" : ^(^) e " D and " : =G) (i - e " 2iD); 

this represents an initial value for the error in the ancestor species of s standard 
deviations away from zero, at an epoch D units of evolutionary time prior to the 
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root. Then it is easy to calculate 



E(£j - e{) = E{Z (0) (e- Ad(M -e~ Xd ° 1 )} 



2A 

where Du = mm{d 0i , doi}, 5u = \doi — doi\ < da, and doi represents the evolution- 
ary time from the root until species %. Hence, writing An := D + Dn for the time 
from the initial epoch until the % and / ancestries split, we have 



m{E i -E l ) = s[-f=) e~^(l-e-^), (14) 



to be compared with the equilibrium standard deviation of Si—ei which, from ()12j) . 
is given by 



SD (*,/):= (-^ y/l-e-^. (15) 



The ratio of these two quantities is thus 

s e~ XAa (I - e" x5a ) 



■ A), 



2 y/i - e-Adii 2 

say, where 

r{i, Z; A) < {e- AA « \fh/A~i. 

Thus there is no mean effect if the branch lengths to i and / are equal (5u = 0), or 
if A = 0, or if A = oo, or if s = 0; more generally, the effect is small if AAj; is either 
small or large, and the largest effect possible, occurring when A = 1/(2A^), is 
t^t^V Oaj 'An. Hence, if the differences in branch lengths are much smaller than 
the times from the initial epoch until species diverge, there can be no appreciable 
mean effect. 

For the variability, the conclusions are entirely analogous. It is straightforward 
to compute 

Var{e 4 -^}-{SD(M)} 2 = (£) e~ 2XA « (1 - e~ x5 -) 2 , 

and the ratio of this quantity to {SD (z, I)} 2 , the relative error in the variance in- 
duced by assuming the process to be in equilibrium, gives the value |{r(i, Z; A)} 2 . 
Once again, there is no correction to be made if the branch lengths to % and I 
are equal (5u = 0), or if A = 0, or if A = oo; and the largest effect possible is 
■^{5u/ An}. Hence, if the differences in branch lengths are much smaller than 
the times from the initial epoch until species diverge, there can be no appreciable 
effect on the variances, either. 
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Table 1: Testis size, mean HTL, duct length and spermathecal size for 15 species 
of Scathophagidae: data from Minder et al. (2005). 

Species Testis size mean HTL Duct Length Spermathecal Area 





(mm J 


(mm) 


(mm) 


(mm ) 




y 


x y ' 


x v ; 


~(3) 


Cordilura albipes 




Z.41U 


.Do4 


.UU4yU 


Cleigastra apicalis 


.U lo 


z.UoU 


/1 1 o 
.41Z 


.UU lv>\) 


Cordilura ciliata 


/IOC 

A6o 


o onn 
o.zyU 


.t>U4 


m *7A 

.Ul (4:6 


Cordilura pubera 


.662 


1 *77£ 


. tit 


.UlOl 1 


Microprosopa pallidicauda 


A 77 

Alt 




CO 1 

.Dot 


.UU lyo 


Norellia liturata 


.382 


2.095 


.962 


.02497 


Norellia spinimana 


.547 


2.295 


1.086 


.02048 


Norellia striolata 


.855 


3.110 


1.384 


.02397 


Phrosia albilabris 


.319 


2.380 


.519 


.01485 


Scathophaga cineraria 


.486 


2.750 


.561 


.01046 


Scathophaga furcata 


.965 


2.430 


.541 


.02195 


Spaziphora hydromyzina 


.134 


2.000 


.235 


.01049 


Scathophaga stercoraria 


.544 


2.815 


.672 


.01044 


Scathophaga suilla 


.461 


2.380 


.386 


.01002 


Scathophaga taeniopa 


.699 


2.695 


.479 


.01347 
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Table 2: Numbers of differing pairs between the sequences of 810 mDNA letters 
coding for the COI gene in 15 species of Scathophagidae: original sequences from 
Genbank. 



abcde fgh 


i 


j 


k 


1 m 


n 


o 




Species 


* 70 64 69 72 73 87 83 


62 


67 


75 


82 63 


65 


70 


a: 


Cordilura albipes 


* 92 93 64 69 76 75 


81 


60 


71 


74 55 


54 


61 


b: 


Cleigastra apicalis 


* 63 80 77 88 88 


71 


73 


68 


89 71 


72 


77 


c: 


Cordilura ciliata 


* 81 83 99 96 


75 


84 


88 


96 81 


81 


85 


d: 


Cordilura pubera 


* 66 79 73 


82 


63 


74 


55 52 


59 


63 


e: 


Microprosopa pallidicauda 


* 67 65 


77 


59 


70 


70 58 


57 


63 


f: 


Norellia liturata 


* 9 


95 


67 


80 


74 69 


72 


79 


g: 


Norellia spinimana 


* 


94 


64 


77 


71 64 


68 


75 


h: 


Norellia striolata 




* 


69 


78 


83 68 


70 


72 


i: 


Phrosia albilabris 






* 


36 


62 35 


20 


30 


j: 


Scathophaga cineraria 








* 


73 43 


41 


45 


k: 


Scathophaga furcata 










* 56 


56 


62 


1: 


Spaziphora hydromyzina 










* 


29 


38 


m: 


Scathophaga stercoraria 












* 


22 


n: 


Scathophaga suilla 














* 


o: 


Scathophaga taeniopa 
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